Use OnceCell to cache Python operations#12942
Merged
Merged
Conversation
Collaborator
|
One or more of the following people are relevant to this code:
|
Pull Request Test Coverage Report for Build 10392333478Details
💛 - Coveralls |
raynelfss
previously approved these changes
Aug 13, 2024
Contributor
raynelfss
left a comment
There was a problem hiding this comment.
This is sensible and, putting aside all the memory saving benefits, it makes the code more readable as we don't need to constantly call borrow or borrow_mut. Hopefully we can bump to PyO3 0.22 and reorganize things soon.
Contributor
|
I had previously approved this PR but it seems to be failing CI now. Any clue as to why @jakelishman ? |
3c765fd to
53d225f
Compare
We were previously using `RefCell` to handle the interior mutability for the Python-operation caches. This works fine, but has two problems: - `RefCell<Py<PyAny>>` is two pointers wide - we had to do heavier dynamic runtime checks for the borrow status on every access, including simple gets. This commit moves the caching to instead use `OnceCell`. The internal tracking for a `OnceCell<T>` is just whether an internal `Option<T>` is in the `Some` variant, and since `Option<Py<PyAny>>` will use the null-pointer optimisation for space, this means that `OnceCell<Py<PyAny>>` is only a single pointer wide. Since it's not permissible to take out a mutable reference to the interior, there is also no dynamic runtime checking---just the null-pointer check that the cell is initialised---so access is faster as well. We can still clear the cache if we hold a mutable reference to the `OnceCell`, and since every method that can invalidate the cache also necessarily changes Rust-space, they certainly require mutable references to the cell, making it safe. There is a risk here, though, in `OnceCell::get_or_init`. This function is not re-entrant; one cannot attempt the initialisation while another thread is initialising it. Typically this is not an issue, since the type isn't `Sync`, however PyO3 allows types that are only `Send` and _not_ `Sync` to be put in `pyclass`es, which Python can then share between threads. This is usually safe due to the GIL mediating access to Python-space objects, so there are no data races. However, if the initialiser passed to `get_or_init` yields control at any point to the Python interpreter (such as by calling a method on a Python object), this gives it a chance to block the thread and allow another Python thread to run. The other thread could then attempt to enter the same cache initialiser, which is a violation of its contract. PyO3's `GILOnceCell` can protect against this failure mode, but this is inconvenient to use for us, because it requires the GIL to be held to access the value at all, which is problematic during `Clone` operations. Instead, we make ourselves data-race safe by manually checking for initialisation, calcuting the cache value ourselves, and then calling `set` or `get_or_init` passing the already created object. While the initialiser can still yield to the Python interpreter, the actual setting of the cell is now atomic in this sense, and thus safe. The downside is that more than one thread might do the same initialisation work, but given how comparitively rare the use of Python threads is, it's better to prioritise memory use and single-Python-threaded access times than one-off cache population in multiple Python threads.
53d225f to
a19d92f
Compare
Member
Author
|
Ray: it's ok, it's just that #12943 merged in the meantime that had a logical but not syntactic conflict. I've rebased this PR to account for the additional cache interaction added in that PR. |
raynelfss
approved these changes
Aug 14, 2024
Contributor
raynelfss
left a comment
There was a problem hiding this comment.
LGTM, thank you for the quick fixes!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
We were previously using
RefCellto handle the interior mutability for the Python-operation caches. This works fine, but has two problems:RefCell<Py<PyAny>>is two pointers wideThis commit moves the caching to instead use
OnceCell. The internal tracking for aOnceCell<T>is just whether an internalOption<T>is in theSomevariant, and sinceOption<Py<PyAny>>will use the null-pointer optimisation for space, this means thatOnceCell<Py<PyAny>>is only a single pointer wide. Since it's not permissible to take out a mutable reference to the interior, there is also no dynamic runtime checking---just the null-pointer check that the cell is initialised---so access is faster as well.We can still clear the cache if we hold a mutable reference to the
OnceCell, and since every method that can invalidate the cache also necessarily changes Rust-space, they certainly require mutable references to the cell, making it safe.There is a risk here, though, in
OnceCell::get_or_init. This function is not re-entrant; one cannot attempt the initialisation while another thread is initialising it. Typically this is not an issue, since the type isn'tSync, however PyO3 allows types that are onlySendand notSyncto be put inpyclasses, which Python can then share between threads. This is usually safe due to the GIL mediating access to Python-space objects, so there are no data races. However, if the initialiser passed toget_or_inityields control at any point to the Python interpreter (such as by calling a method on a Python object), this gives it a chance to block the thread and allow another Python thread to run. The other thread could then attempt to enter the same cache initialiser, which is a violation of its contract.PyO3's
GILOnceCellcan protect against this failure mode, but this is inconvenient to use for us, because it requires the GIL to be held to access the value at all, which is problematic duringCloneoperations. Instead, we make ourselves data-race safe by manually checking for initialisation, calcuting the cache value ourselves, and then callingsetorget_or_initpassing the already created object. While the initialiser can still yield to the Python interpreter, the actual setting of the cell is now atomic in this sense, and thus safe. The downside is that more than one thread might do the same initialisation work, but given how comparitively rare the use of Python threads is, it's better to prioritise memory use and single-Python-threaded access times than one-off cache population in multiple Python threads.Details and comments
GILOnceCellwould likely be better here, but it doesn't deriveClone, nor would that make sense, since it uses the GIL to mediate even getter access. When we're able to move to PyO3 0.22 and disable thepy-clonefeature of it, we might be able to reorganise this a little bit to avoid needing to handle the GIL-atomic initialisation manually.I measured this as saving 165MB (around 10%) in resident-set size from this benchmark:
over its parent commit. That's the same microbenchmark we used in #12730. That should have a (small) impact on copy and circuit-build performance due to needing to move around less memory, and access times on the Python operation cache should be smaller too, though that shouldn't be in the critical path.
The biggest memory problem remaining in the
CircuitDatamodel (by far) is the parameter storage at the moment - in this particular benchmark, we spend something around 559MB storing the parameters of the 9,000,000 parametric gates, of which 15% is malloc overhead (on macOS, at least, but probably most Unix-likes and not dissimilar on Windows) and within the actual allocation, about 70% is just padding bytes.